Electricity Price Forecasts in Spain

Exploratory Data Analysis (EDA & Data Preparation)

Comments:

From the Pandas profiling, we can see that we have 10 variables, 9 of them numberic and 1 categorical. The total number of observations is 32135, with 26 values missing.

Interactions:

Pearson's Correlation:

Phik Correlation: The Phik correlation showed similar results to the Pearson's correlation, with some exceptions for solar energy:

Null Values Treatment

Outlier Treatment

1. Outlier Removal

2. Outlier Clipping

Feature Engineering

Comments

From the Pandas profiling, we can see that, after feature engineering, we have 19 variables, 14 of them numberic and 5 categorical. The total number of observations is now 28048, with 0 values missing.

Interactions:

Pearson's Correlation:

Data Standardization (if required)

Feature Selection

Based on the previous analysis about the variable's correlation matrix, we decided to consider ['Thermal_Gap', 'fc_wind' , 'fc_solar', 'BoT_FR' , 'holiday' , 'hour', 'weekend' , 'thermal_gap_daily', 'covid_period', 'hard_lockdown']] as preliminary features for our model.

Time Series Dataset Splitting (train/test)

1. Test Set - Using the last 20% of the dataset

2. Test Set - Using TimeSeries Split

3. Test Set - Using Block TimeSeriesSplit

Models

1. LR Baseline

2. LR Baseline (TSS)

3. LR Baseline (BTSS)

Comments: It appears both TSS and BTSS series splits perform better than a non-random train/test split. Given our discoveries regarding the mechanisms behind electricity price forecasts, this is not surprisng and validates our initial findings.

4. Ridge Linear Regression (TSS)

4.1. Standardized Variables (in order to allow for regression coefficient comparison)
4.2. Non-Standardized Variables

5. Ridge Linear Regression (BTSS)

5.1. Standardized Variables (in order to allow for regression coefficient comparison)
5.2. Non-Standardized Variables

6. Lasso Linear Regression (TSS)

6.1. Standardized Variables (in order to allow for regression coefficient comparison)
6.2. Non - Standardized Variables

7. Lasso Linear Regression (BTSS)

7.1. Standardized Variables (in order to allow for regression coefficient comparison)
7.2. Non - Standardized Variables

8. ARIMA MODEL

Running & Cross Validating ARIMA (Long computation and not essential for this exercise. Mean RMSE +- 2.0)

9. Random Forest

9.1 Random Forest (TSS)
9.2 Random Forest (BTSS)

Cross Validation & Model Fine Tuning

Temptative Optimal Fine Tuning (cv = btss : extremely heavy computation)

10. Support Vector Regressor (SVR)

Note: Requires data to be standardized.

10.1. SVR (TSS)

Cross Validation & Model Fine Tuning

10.2. SVR (BTSS)

11. Decision Tree

11.1. Decision Tree (TSS)
11.2. Decision Tree (BTSS)

Cross Validation & Model Fine Tuning

12. Gradient Boosted Trees

12.1. Gradient Boosted Trees (TSS)
12.2. Gradient Boosted Trees (BTSS)

Cross Validation & Model Fine Tuning

13. XGBoost

13.1 XGBoost (TSS)

13.2 XGBoost (BTSS)

Top 3 - Candidate Models:

Obtaining Predictions

Loading & Transforming Scoring Dataset

Making Predictions

Appendix

Price Lag Analysis (Example Only - Analysis can be replicated with the Notebook)

Emphasizing Covid Period (Training and Test Sets during Covid Times exclusively)

Emphasizing Covid Period (Training Set September 2019 - September 2020)

Imputation w/ Linear Regression